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Abstract. Dropout has recently emerged as a powerful and simple method for 
training neural networks preventing co-adaptation by stochastically omitting neu¬ 
rons. Dropout is currently not grounded in explicit modelling assumptions which 
so far has precluded its adoption in Bayesian modelling. Using Bayesian entropic 
reasoning we show that dropout can be interpreted as optimal inference under 
constraints. We demonstrate this on an analytically tractable regression model 
providing a Bayesian interpretation of its mechanism for regularizing and prevent¬ 
ing co-adaptation as well as its connection to other Bayesian techniques. We also 
discuss two general approximate techniques for applying Bayesian dropout for gen¬ 
eral models, one based on an analytical approximation and the other on stochastic 
variational techniques. These techniques are then applied to a Baysian logistic 
regression problem and are shown to improve performance as the model become 
more misspecified. Our framework roots dropout as a theoretically justified and 
practical tool for statistical modelling allowing Bayesians to tap into the benefits 
of dropout training. 

Keywords: Baysian Methods, Dropout, Entropic Reasoning 


1 Introduction 


Consider a probabilistic model of a dataset x parameterized by a vector of parameters 
6. Data often contain complicated structure, and so there is a growing interest in 
expressive models. An expressive model will have many settings of its parameters which 
are compatible with the training data, and if the model is misspecified these settings 
will often make different predictions on the test data. This problem is sometimes called 
co-adaptation because different coordinates of the parameter vector co-adapt to each 


other to give predictions specific for the training and not the test data (Hinton et al. 


2012). A consequence is there is no guarantee the model will concentrate on parameter 


values which give the better prediction on test data (Griinwald and Langford 2007 
Miiller|[MI| ). 


Dropout, originally proposed for neural networks, provides a powerful way to reduce 
undesirable co-adaptation amongst parameters ( Hinton et al.|[20l2 ). Dropout stochas¬ 
tically perturbs (typically by setting equal to zero) parameters in the model during 
training and this limit the parameters preference towards co-adaptation and has been 


shown to outperform other state-of-the-art methods (Krizhevsky et al. 2012 Hinton 


et al. ]M2l ). Related to dropout is the idea of bagging and feature corruption where the 


input data is being perturbed (Burges and Scholkopf 1997 Maaten et al. 2013 Chen 


et al.|[20l2), thereby increasing the effective size of the training data. 
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None of these past approaches have a natural Bayesian interpretation but corre¬ 
spond, in the case of feature corruption, to either enlarging the training set ( [Burges and 
Scholkopf 1997), minimizing appropriately averaged log-likelihood functions (Maaten 


et al. 2013) or in the case of dropout for neural networks, stochastically minimizing 
a changing error function (Hinton et al. 2012). Even though one can easily trans¬ 
form a minimization task of a cost-function to maximum likelihood of a distribution by 
multiplying with a negative number, exponentiating and normalizing, such ad-hoc prob¬ 
abilities would have no justification from a Bayesian viewpoint nor provide a natural 
guideline for devising alternative methods. 

Dropout is however not limited to feature corruption but extends to the corruption 


of latent variables such as the hidden units in neural networks. Hinton et al. (20121 dis¬ 


cussed how dropout thereby is similar to bagging and easier to implement than Bayesian 
model averaging as exponentially many dropout networks can be approximated by a sin¬ 


gle pass of the trained mean network. Wang and Manning (2013) provide an efficient 


training procedure for dropout based on approximating the dropout distributions by 
Gaussian distributions, and demonstrated how dropout training can be viewed as max¬ 
imizing a lower bound of the Bayesian marginal likelihood, marginalized with respect 
to these approximating distributions. Dropout has thereby previously been considered 
an alternative to Bayesian model averaging, but a key observation is that none of these 
methods are Bayesian and they do not allow computation of the probability of param¬ 
eters, model evidence and predictive density of unobserved data. 

In this work we identify the probabilistic method corresponding to dropout and 
thereby the natural assumptions under which dropout is the unique optimal assignment 
of degrees of rational belief. To do so we exploit that Bayesian inference is a special case 
of a more general theory for obtaining rational beliefs, namely the extended method 
of maximum entropy (ME) ( jCaticha 2010 Gifhn and Caticha 2007) and this allows 
dropout to be incorporated as a principled tool in Bayesian modelling. We demonstrate 
how dropout can be implemented in Bayesian linear regression and Bayesian logistic re¬ 
gression and when maximizing the resulting likelihood function we recover non-Bayesian 


dropout approaches for logistic regression based on loss-functions (Wang and Manning 
2013 Maaten et al.|[2M3 |. 


2 Assigning rational beliefs 


Machine learning consists in constructing systems which can predict, explain and control 
their environment in a rational manner based on information and past beliefs despite 
that the information available is often incomplete. R. T. Gox convincingly showed a the¬ 
ory of degrees of rational beliefs is only consistent if the degrees of belief are isomorphic 


to probabilities and obey the sum and product rule of probability theory (Gox 1946). 


This insight justify the use of probabilities to represent states of belief and has given 
rise to Bayesian methods, where the past beliefs are identified with the prior beliefs 
of the variables of interest and the available information is the observed data and the 
relationship between data and parameters. For this reason Bayesian methods are often 












































T. Herlau, M. M0rup and M. N. Schmidt 


3 


identified with the application of Bayes theorem 

( 1 , 

p{x) 

to arrive at a posterior distribution on the parameters of interest, 6. When Bayes 
theorem is applied with this interpretation it is sometimes called Bayes rule. 

The important distinction we wish to draw is that Bayesian inference is, rightly 
understood, a consistency requirement on assignments of rational belief (namely the 
usual sum an product rules) and not the normative statement the only rational way to 
assign posterior beliefs of parameters 6, in the presence of data is by applying Bayes 
rule eq. Q. That there exist a distinction between how to obtain beliefs and Bayesian 
theorem is obvious: Bayes theorem only tell us how consistency allow us to express one 
probability in terms of others and should apply regardless what those other probabilities 
are. At some point one must specify, based on arguments other than probability theory, 
both a model (the likelihood) and a prior distributions to perform meaningful analysis. 
To take the simplest case, nothing in probability theory inform us of the numerical value 
of the probability a coin will come up heads in the next toss, only that the probability 
it will come up heads, p(Heads), is related to p(Tails) through 


p(Heads) +p(Tails) = 1. 


( 2 ) 


To arrive at the obvious answer p(Heads) = p(Tails) = ^ requires additional argu¬ 
ments ( Jaynes|[2003). 


Based on this observation, we argue the problem of reasoning in the presence of data 
is one of obtaining beliefs (i.e. a probability distribution) over the set of parameters 0, 
and not as automatically implying the application of Bayes rule eq. 0 - 

In many situations, the relevant information available which guide how we obtain be¬ 
liefs are not in the form of data and probabilistic models, but in the form of constraints. 
A canonical example is an ideal gas where the relevant constraints include energy conser¬ 
vation and invariance under rigid coordinate transformations. The method of maximum 
entropy, MaxEnt, allows assignment of rational beliefs under these constraints |Jaynes| 

(TgF^. 


A more general system of rational inference must be able to handle both types 
of information, i.e. observed data and arbitrary constraints, in an objective manner. 
If these constraints are available to us in the form of a model (a constraint on the 
class of posterior functions) and observed data, the method must reduce to Bayesian 
inference. If the constraint is in the form of expectations, such as an energy constraint, 
it must reduce to MaxEnt. This can be accomplished by the extended method of 
maximum entropy (ME), which allows rational inference under both types of constraints 


and contain both MaxEnt and Bayes rule as special cases (Giffin and Caticha 2007 
Caticha||20I0 ). 
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2.1 Extended method of maximum entropy 

Denote all variables of interest hy z. In the context of the previous section z = (x,9). 
One way to phrase the goal for a rational learner is to update from a prior distribution 
q{z) to a posterior distribution p{z) when new information is made available. 

The relevant information can come in the form of observed data, priors, the form 
of the likelihood or other restrictions which all constrain the form of the likelihood to 
belong to a family of distributions p € C fulfilling these constraints. The method of ME 
make the assumption that not all distributions p € C are equally desirable and assume 
they can be put in an order of preference. Since an order of preference is transitive 
and must depend on our past beliefs, we can write it as a functional S[p,q\. The 
interpretation of S is for any two distribution pi,p 2 € C then if 

S[pi,q\> S[p 2 ,q] (3) 

we say pi is more preferable than p 2 on the grounds of q. Thus the distribution one 
should choose p should be the most preferred distribution in C, the set of allowed 
distributions. That is to say, given an order of preference S, learning consist of the 
computation 


p = arg max S'[p', q], 
p'ec 


( 4 ) 


When the problem is posed this way, one can pose additional desiderata such an order of 
preference S must fulfill (locality, coordinate invariance and consistency for independent 
subsystems) and use these to limit the possible choices to a single functional form. This 


type of derivation is entirely parallel to the development of probability theory in Cox 


(jl946 1 where what was being analysed was degrees of rational belief. The project of 
analysing an order of preference was first posed and undertaken by Shore and Johnson| 
however the proof contain a critical flaw as pointed out by Uffink (19951. The 


derivation we rely on is due to Giflin and Caticha (2007) who argue under natural 


assumptions S must have the form of the Kullback-Leibler divergence (the same form 
as argued for by Shore and Johnson (1980|): 


S[p, q] = - jdz p{z) log 


P(^) 

diz)' 


( 5 ) 


In a machine-learning context where we divide z into a set of observed data points x 
and parameters 6 the relevant information is that we observed an actual value of the 
data, a;', and this places the constraint on the family C that 


C = {p\ fdO p{x, 9) = S{x - x')} , 


( 6 ) 


where S is the delta-function. Maximizing eq. ([^ under this constraint exactly recovers 
Bayes’ rule (Giflin and Caticha 20071, however the method is more general than Bayesian 


inference by allowing arbitrary constraints. 
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Since dropout was originally introduced in the context of neural networks (Hinton et al. 


2012), we too will introduce it in this context. In the neural network formulation of 


dropout we consider a function (the neural network) parameterized by a set of weights 
9 which maps inputs X to outputs y as: y = fg{X). In the following the output y 
will consist of a binary classification task. The simplest way to train the network is by 
gradient descent with respect to 9 on the quadratic loss between all input and output 
training pairs x = {Xi,yi)f^-^^: 




( 7 ) 


Dropout is a simple extension where between each gradient descent step and for each 
observation Xi a perturbed set of parameters is generated 9i ~ p{-\9) and a single 
gradient update is performed on the modified error function 




( 8 ) 


The perturbed set of parameters are obtained by simply blanking out a fraction (typi¬ 
cally half) of the hidden units independently ( Hinton et al.|2012 |. At the next iteration, 
a new set of perturbed weights are chosen and the procedure is repeated. In the limit 
of low learning rate on 9, dropout will favor weights 9 that tend to give good perfor¬ 


mance when a fraction of the inputs are missing thereby reducing co-adaptation (Hinton 


et al. 2012). Another way to view dropout is as an efhcient way to do model averag¬ 


ing where the average run over all possible perturbation of the parameters. While this 
feature of dropout illudes to Bayesian model averaging, there is the key difference that 
dropout does not favor or learn any particular perturbation of the parameters, whereas 
a standard Bayesian formulation of dropout would. 

To be precise, consider a Bayesian equivalent of a neural network model with joint 
likelihood p{x, 9) = p{x\9)p[9) where the likelihood term is 


9 

for each i: yi\9 


'P( • )> 

' Normal ( 




(9a) 

(9b) 


It is easy to see that assuming flat priors the MAP solution to this model will be exactly 
equivalent to the global maximum of the neural-network error eq. ([^ . We might consider 
adding a dropout step to the generative model: 

9^p{-), (10a) 

for each i: 9i\9 ^ p{-\9), (10b) 

yj|0-Normal( • ;/g. (X^), cr^). (10c) 

While analytically related, this model does something very different from dropout be¬ 
cause it exactly infers which weights are best to blank out to explain each observation 
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and not produce a distribution over the weights which attempt to be robust to having 
weights removed in the same sense as the maximum of the dropout error eq. ([^. The 
effect of this will be twofold. Firstly, it will create a mixture of models, each model 
(corresponding to a set of dropped-out parameters) being allowed to co-adapt to the 
data. Secondly, the mixture will be weighted with the posterior probability, reducing 
the effective number of components. 

To implement the parallel method to dropout, we must specify dropout as an ex¬ 
ternal constraint on the learners representation, namely that the weights are zeroed 
stochastically. To put this succinctly 

Bayesian Dropout: The dropout distribution 0ij0 is not inferred from data. (11) 

To implement this using ME, we first need to fix the prior measure q and the rele¬ 
vant constraints. As prior distribution q we adopt the same functional form as the 
naive Bayesian dropout. Collection all the perturbed weights as 9 — (0i)(h;^, the joint 
distribution may be written 

q{x, 9, 9) = q{x\9)q{9\9)q{9). (12) 


The key point is that since no observations have been made, the prior beliefs reflect the 
uncertainty in both x and 9. 


Next we specify the constraints: As in standard application of ME we have the data- 
constraint that we observed an actual value of x, namely x' (see eq. §). Secondly, we 
have the dropout condition eq. (11). To say weights are dropped out stochastically and 
this distribution is not inferred from data is saying exactly that the dropout weights 
must depend on 9, but not on x and otherwise reflects the prior distribution. We can 
then identify Bayesian Dropout (as formulated in 0) with the constraint 

p{9\x,9) = q{9\9) (Bayesian Dropout) (13) 


Accordingly we have p{x, 9, 9) = p{9\9)p{x, 9). The ordering (i.e., preference according 
to ME) of distributions becomes: 


S[p, q] = - (dxded9 p{x, 9,9) log 

J q[x^O^O) 

= - f dxd9d9 p{x, 9)p{9\9) log ■ (14) 

J q{x\9)q{9) 

The distribution which uniquely maximizes this functional can be found by taking the 
functional derivative with respect to p{x, 9) while introducing lagrange multipliers Aa; 
to handle the (infinite) number of data-constraints eq. (|^ and a to handle the sum 
constraint. This leads to the variational problem: 


0 = 


6p{x,9) 


S[p, q]+a 
/ dxX, 


J dxd9d9 p{x, 9,9) — 1 
d9d9 p{x, 9,9) — 5{x — x') 


( 15 ) 
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Performing the functional derivative and solving for p{x^ w) gives 


p{x,e) 


-^q{9) exp 


dO q{0\9) log(3'(a;|0) + 


(16) 


where Z handles normalization. Recall the requirement that the posterior is consistent 
with the observed data x' 


Jd9 p{x,9) = S{x — x'). (17) 

Using this identity on the right-hand side of eq. ( |I^ to fix the Lagrange multipliers Aa; 

1 r f,. 

(19) 


P{x, 9) = exp yj d9 q[9\9) log q{x\9) ) 5(x - x') 

Z{x) = fd9q{9)exp [ d9 p{9\9) log q{x\9), 


such that the distribution of the parameters alone becomes 

P{0) = (^J d9 q{9\9) log q{x'\9)^ . (20) 

If p{9\9) = 5{9 — 9) the expression reduces to Bayes’ theorem (keep in mind p refers to 
our final state of knowledge which is why we do not condition on a:'), but otherwise if any 
particular 9 is such that a single dropped-out version 9 oi 9 performs very poorly (but 
other dropped-out versions give good performance), it will be more heavily penalized 
due to the log term. 


4 Inference 


Although the Bayesian dropout target eq. (20) is a closed-form proper probability distri¬ 


bution, the inner expectation make the target infeasible to evaluate by explicit summa¬ 
tion for all but the simplest models. For a few models it will be possible to analytically 
compute the sum to obtain an expression which can be sampled or approximated using 
ordinary techniques and we will return to one such example, Bayesian linear regression 
in the simulation. However to propose Bayesian dropout as a general technique requires 
general tools for inference. 

In this section we will briefly discuss two strategies for the case where the Bayesian 
dropout target is infeasible to evaluate analytically or by summation. The hrst technique 
is analytical approximation of the inner-most expectation to arrive at closed-form target 
which can be sampled. Although generally applicable, the feasibility and goodness on 
this scheme depend on the specifics of the models. The second proposal is a variational 


approximation scheme using stochastic variational Bayes (Hoffman et al. 2013 Salimans 
and Knowles||2012 ). The goodness of this scheme depend on the flexibility of the class 
of variational distributions, however as we will see the method is otherwise exact and no 
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harder to implement than ordinary stochastic variational Bayes for all common dropout 
distributions q{9\d). In other words, if the class of variational distributions contain 
the true posterior eq. (201, the variational approximation will be exact and, at least in 
principle, not more expensive to compute than ordinary stochastic variational Bayes. 


4.1 Analytical Approximation 

The first scheme we will consider is analytical approximation of the Dropout target. 
Consider the case where the inner-most expectation take the form 


d0 q(d\0}logq(x\0) = d0 q{0\0)'^logq{x,\fi{0)). 


( 21 ) 


We have assumed the observations are independent conditional on a (function) of the set 
of parameters. Assuming the number of dimensions are large, and for simplicity assume 
the value of the function hi is a scalar, we can consider the value of hi{0) (conditional 
on 0) as being approximately normally distributed with mean and variance {pejCrg). 
Accordingly the above integral can be approximated 


d0 q{0\0) logq{x,\hi{0)) [logq{x,\h)] 


( 22 ) 


which is the type of approximation we will use in section If the right-hand side of 
eq. (22) is not available, an even more general approximation can be obtained by Taylor 
expanding eq. (22) around the mean pe to obtain 


d 0 q{ 0 \ 0 ) log q{xi\h,{ 0 )) « logq{xi\p 0 ) 


logq{xi\h) 




al- 


(23) 


h=fj.g 


However we will only need to make use of eq. (22) in the simulations. 


4.2 Stochastic Variational Approximation 


Variational methods provide a popular tool for obtaining approximation to intractable 


posterior distributions (Jordan et al. 1999), sometimes at dramatically less cost than 
Monte Carlo (Honkela et al. 20I0| . Variational methods work by finding an approxi¬ 
mating distribution (from some convenient class of distributions) which minimizing the 
Kullback-Leibner divergence to the intractable target distribution, however this has tra¬ 
ditionally limited variational methods to the case where certain integrals involved in the 
minimization are analytically computable. 

Recently stochastic variational Bayes has been proposed as an optimization scheme 
for variational methods ( Hoffman et al.|[20I3 Paisley et al.||2012 ). Rather than seeking 
analytically optimal projections of parts of the variational approximation to the target 
distribution, one instead tries to iteratively optimize the parameters of the variational 
approximation. Since this does not require finding analytical solutions optimization 
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problems, it dramatically enrich to space of variational distributions which provide 
tractable inference and allow for instance to use a variational family consisting of mix¬ 
tures of distributions in the variational family. 

More importantly for our application, stochastic variational Bayes provide an easy 
and general way to perform variational inference for distributions where q{9\6) in 
eq. (20) is only required to allow an efficient sampling scheme. To see this, and be¬ 


cause we will consider some simple modifications in the experiments, we will briefly 
review stochastic variational Bayes following the presentation by|Salimans and Knowles] 

( 2 M^. 


Stochastic Variational Bayes 

Recall X denote the set of observed data and 6 the set of relevant parameters of the 
model. The basic idea in variational methods is to approximate the posterior distri¬ 
bution p(6\x) by an approximate distribution rrf{6) (from a family of distributions 
parameterized by 77 ) where the optimal parameter is found by minimizing the Kullback- 
Leibner divergence (Jordan et al.||1999l 


6 = argminKL(r,)(0)|p(a;, 6)). 


(24) 


In the following we assume the variational distribution belong to the un-normalized 
exponential family 


rfj{e) = exp[f{e)fj]iy{e) 


(25) 


where T{x) is related to the ordinary sufficient statistics T{x) by adding a constant 
T{x) = [1, T(x)] and 77 the set of variational parameters with the normalization absorbed 
77 ^ = [ 770 , 77 ^]. The relevant quantity to minimize is the un-normalized Kullback-Leibner 
divergence: 


KL{f,-j{e)\p{x,e)) = J diy{e) rfj{0) 

Taking the derivative and solving gives 

o = Vf,KUhf,{eMx,e)) 


log 


rfjjO) 

p{x,e) 


- 1 


(26) 


= dn{e)ff,{e) f{eff{e)f,-f{efiogp{x,e) 


Implying: 77 = 


dnio) 


dv{e) h,{e)f{ef \ogp{e) 


(27) 


(28) 


If we introduce the vector g = E^.-[T(0)’^ logp(a;, 0)] and the symmetric matrix C = 
¥.r-[T{9)'^T{0)] expression eq. (28) simplifies to the linear regression-type problem: 


g = C-^g. 


(29) 
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Algorithm 1 Stochastic variational Bayes with dropout 


Initialize initial guesses of parameters 771 , Ci and gi 
Initialize aggregated guess of variational distribution C = 0,g — 0 
Initialize step-size e 
for t = 1, ..., N do 

Simulate S i.i.d. draws 0(^1 • ■ • ^ts from current approximative distribution qrf^{9) 
for each generate the perturbed (dropped out) version 6^^ of the parameters by 
applying binary corruption with rate / 

Using eq. (31a[) update the variational parameters using unbiased estimators: 






9t+i = ^Y.s=i9{9t 

Ct+i = lT.tine:sff{el). 

Update current (averaged) parameter settings: 
9t+i = (1 - s)gt + egt 
Ct+i = {l- e)Ci+eCt 

f]t+i = C^j^^gt+i 

if t > Y then 


9_= 9_+ 9t 
C = C + Ct 

end if 
end for 

return i) = C g 


Since both C and g depend on r), the expression is not a closed form solution nor does it 
avoid problematic integrals. However standard arguments from numerical optimization 
such as those given by Salimans and Knowles (2012) show updating fj iteratively using 
eq. (291 will result in a convergent procedure even if one replace the quantities C and g 
by unbiased estimators C, g. A natural choice is to sample S values of 0 from where 
S can be as low as 1. The possibility of replacing the variational distribution with an 
unbiased estimator was exploited by Hoffman et al. (2013) by subsampling the data x, 
however the method is well suited for double stochastic problems since the logarithm of 
eq. ([ 20 |) take the form: 


g= dGde ri,{e)q{e\e)f{Of log q{x\e)q{e) 


(30) 


An an unbiased estimator can be obtained by sampling S values 6a ^ and, con¬ 
ditional on each sampled value, sample 6 from q{6\6a) and evaluate the expectation. 
While it is possible to approximate the expectation of 9 with more than one 9 value 
for each 0 we choose to use one for simplicity. Accordingly unbiased estimators are 




















T. Herlau, M. M0rup and M. N. Schmidt 


11 


obtained as 


9= g'^9{Ss\0s)T{e,flog q{x\9s)q{e,) 


S = 1 

S 


c = -J^ne.fne,). 


(31a) 

(31b) 


To implement stochastic var iati onal Bayes require specification of the variational family 
and iteratively applying eq. (31) to compute g and C and compute the current sufficient 
statistics ?) using eq. (29). The full algorithm is then simply a particular instance 
of Algorithm 1 of Salimans and Knowles (2012) and this reference can be consulted 
for proofs of convergence, etc. For learning rate e and in our notation it is given in 
algorithm the output of the algorithm is the parameter vector rj of the variational 
distribution. 


5 Simulations 

To illustrate the effect of dropout we will consider two different models, a Bayesian 
linear regression model and a logistic regression model. The former admit analytical 
computation of the dropout target and allow us to study how dropout penalize different 
features, however owning to it’s tractability the derived dropout posterior could have 
been obtained by appropriate normalization of the data. 

The second example, logistic regression, does not allow a closed form expression 
and will be used to illustrate the analytical approximation and stochastic variational 
Bayesian inference methods discussed in the previous section. 


5.1 Bayesian linear regression with dropout 


To examine the properties of Bayesian dropout, we consider a simple Bayesian lin¬ 
ear regression model in which posterior inference is analytically tractable. Implement¬ 
ing Bayesian dropout for linear regression amounts to choosing an appropriate weight- 


corruption distribution and evaluating eq. (20). We choose the standard conjugate form 


of the linear model and assume that the weights have zero mean Gelman et al. (2003). 


Accordingly the joint prior measure q is defined by the generative process 



Gamma(ao, bo), 

(32a) 

la cr^ ^ 

- Normal(0, f^j). 

(32b) 

Wi\w - 


(32c) 

yi\wi,a'^ - 

^ Normal(A'fmi,(T^), 

(32d) 

where y is an n-dimensional vector of responses, and w 
vectors of weights and covariates. Compared to eq. (20), 0 = 

and Xi are p-dimensional 
(w,a). For the variance we 
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consider the limit Oq, &o 0 corresponding to the Jeffreys prior, (T^|ao, 69 ~ u As a 
dropout distribution qf{w\w) we consider independent binary dropout with probability 
/, 

qf{w\w) = ]^[(1 - f)S{wid - Wd) + fS{wid)]. (33) 

d 

Computing the expectation of the log likelihood with respect to the dropout distribution 
yields 

( logq(y|m)) = log(27rcr^) - - vViv - v) - /(I - (34) 


y = {i-f)Xw, Ai = (a:Tx)oj, a: = [Xi,a: 2 ,--- (35) 

where X an n xp matrix of covariates and o is the Hadamard product. This is combined 
with the conjugate prior for in, 




p/2 


, -^0 T 

exp 1 I : 


(36) 


to yield 
p(w,aly) = 


(27rcr2)- 


exp [i'w - - M„) + y^y - > (37) 


A = Ao7 + /(I - /)Ai + (1 - f)^X^X, = (1 - f)^-^X^y. 


(38) 


Note that this is equal to the ordinary Bayesian linear model in the limit / = 1. Further¬ 
more, note that if the matrix of covariates is normalized Ai is a scaled identity matrix 
and Bayesian dropout becomes identical to ridge regression. Examining eq. (37), an 


alternative view on Bayesian dropout is as a data dependent prior that averts over- 
confidence under model misspecification as the data size increases. As such, weights of 
larger features are regularized more heavily by dropout than by ridge regression as also 


discussed by Wang and Manning (2013). Also note that removing the prior term Aq and 


taking the maximum likelihood of eq. nil with respect to w while keeping a constant 
recover the expression for learning with marginalized corrupted features (Maaten et al. 


2013). 


Dropout as parameter shrinkage 


One way to understand regularizers and Bayesian priors is that they improve gener¬ 
alization performance by shrinking parameter estimates towards zeros (or some other 
value). From eq. (37) it is evident that in the linear regression model when the prior 
is Aq = 0 , dropout shrinks parameters towards A^^X^y, corresponding to the maxi¬ 
mum likelihood estimate of each weight in isolation. Thus, dropout provides shrinkage 
towards a solution with no co-adaptation. We illustrate this in the left-pane of Figure 
for a quantitative structure property relationship (QSPR) data example from the lit¬ 
erature Wold (2001) with n = 19 observation and p = 7 highly correlated covariates. 
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Figure 1: Left panel: Comparison of dropout, ridge regression and Lasso for the quanti¬ 
tative structure property relationship (QSPR) data. Right panel: Performance of ridge 
regression and dropout in the three scenarios and three conditions. 


We flipped the sign of one of the covariates (DGR) so that all covariates were posi¬ 
tively correlated with the response. In contrast to ridge regression and Lasso, dropout 
shrinks the parameters towards an all positive solution, as would be expected when all 
covariates are positively correlated with the response. 


Generalization performance 

Next, we examined the generalization performance of Bayesian linear regression with 
dropout on simulated data (see right-pane of Figure [^. We generated n = 20 training 
and test data from the normal linear regression model with Ag = cr^ = 1 and plotted the 
crossvalidation squared error averaged over 100 000 random data sets. In dropout we 
used an incorrect prior Aq = 10“^ to examine its performance under model misspecifica- 
tion. The covariates X were chosen as X = RL where R was a 20 x 10 standard normal 
i.i.d. random matrix and L was a 10 x p random projection matrix where each column 
had unit length. We considered both the underdetermined {n > p = 10), determined 
[n = p = 20), and overdetermined (n < p = 40) scenario under three different condi¬ 
tions: In the default condition, the response was generated directly from the model. In 
this condition, ridge regression with the correct prior is optimal, and dropout was found 
to perform on par. In the junk features condition, we set all but the first 10 weights to 
zero when generating the data, corresponding to a misspecified prior or as having p — 10 
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Figure 2: Performance of Bayesian linear regression with L 2 regularization (ridge re¬ 
gression) and dropout on four binary prediction problems. Performance is classified by 
the fraction of mis-classified test data, lower is better. 


noninformative covariates. Again performing on par, both ridge regression and dropout 
could counter this by increasing the amount of regularization. Finally, in the covariate 
shift condition we generated the data as in the default condition but multiplied each 
covariate by a normal random number before htting the models. We found that ridge 
regression was quite sensitive to this type of model mismatch whereas dropout appeared 
significantly more robust. 


Finally, we evaluated Bayesian linear regression with dropout on the Amazon review 
dataset (Blitzer et al. 2007). The dataset consists of a bag-of-words representation of 
product reviews in four categories: Books, DVDs, Electronics, and Kitchen, and the task 
is to predict if the review is positive or negative. The number of observations ranges 
from 3 587 to 5 946 and the number of covariates from 123 099 to 193 220. We randomly 
selected 75% of the documents as training data and computed the predictive error on the 
remaining documents (see Figure]^. We computed the average and standard deviation 
of the test error over 40 simulations for each data point for Bayesian linear regression 
with and without dropout for varying values of the scale parameter Aq. Here, dropout 
significantly improved generalization performance for a wide range of parameter settings. 


5.2 Dropout for logistic regression 

The second model considered is the standard logistic regression model with a normal 
gamma prior on the weights and an inverse-gamma prior on the variance of the weights. 
Accordingly the generative process become 

~ Gamma(a, 6), (39a) 

~ A/'(0,cr^), (39b) 

yi\w ~ Bernoulli((T(Xj’i(;)), a{x) = (1 -I- exp(a;))“^. (39c) 


k = 1,. ..,p 
i = 1,..., n 


















T. Herlau, M. M0rup and M. N. Schmidt 


15 


For this model the dropout target appear analytically intractable for binary corruption 
and we will use it to illustrate the two approximate approaches mentioned in section 


Inference using analytical approximation 

Let 6 = and recall Xu = 1 to account for bias. Consider again binary 

feature corruption of eq. (331 with probability / restricted to not affecting the bias 
term wi or a~^. For each fixed observation and conditional on random iid. draws 
of 7i|/ Bernoulh(/) we approximate the random sum 


XiW = wi + ^ XijWj-fi 

i=2 

with a Af{h; a^) distribution where 

p p 

hm = Wi + {1 - f)Y^ X,jWj al, = /(I - /) '^{XijWj)^ 
i=2 j=2 


(40) 


(41) 


(42) 


The relevant quantity to approximate eq. (22) become 

[logCr(XiU;)] 

While this expression does not have a closed-form solution, the logistic function can be 
closely approximated by the error function ( Crooks||2009 1 giving the following approxi¬ 
mation 


1 1 


- -b -erf ( —X ) , 


Jdx a{x)N'{x; fi, a^) « a (43) 


where bo = This allow us to approximate the expectation of the log of the 

logistic function as 


d d 

[loga(x)] = [\oga{fi-\-sx)] 


1 [^(^)] 


(44) 

(45) 

(46) 


which implies [logCT(x)] = fi-bo log (^exp -b 1^ = 6o log cr . The full 

likelihood eq. (|20| is then simply 


p(w,a '^\y,X) = —q{w\a '^)q{a 2) Bernoulli f cr 


2=1 


(47) 


P 


P 


/i* =XaWi -b (1 - f)^XijWp af = /(I - /) ^ (XijWjf and 6, = 1 -b 

j=2 j=2 


(48) 
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Australian 

German 

Heart 

Pima 

Ripley 

Observations n 

690 

1000 

270 

532 

250 

Features p 

15 

25 

14 

8 

7 


Table 1: Datasets considered in the simulations 


This closed-form expression can be sampled by any standard method for sampling con¬ 
tinuous densities. For our simulations we choose Hamiltonian Markov chain Monte 
Carlo (Neal 20111 to sample w and ordinary Metropolis-Hastings random walk sam¬ 
pling for a~. 


Inference using stochastic variational Bayes 

For stochastic variational Bayes we applied algorithm For the choice variational 
distribution r^{d) = exp([T(0)?7]) we choose a product of p normal distributions for 
each coordinate of w, i.e. for each dimension k: 

= exp[T('=)(u;fc)?7fe], T‘''‘\w) = [w, w% (49) 

where it is assumed rjk = [Tlki,'ilk 2 ]'^■ For the precision we choose a gamma distri¬ 
bution as the variational distribution 


^r,o(o- ^) = exp[r(°)(cr 2)ryo], r(°)(cr 2) = [-21ogcr, a 


(50) 


and the full variational distribution for 9 = (tu, cr ^) is then r^(0) = {a 11^=1 {^k)- 

Although this method is sufficient, we found the precision cr“^ had a strong influence 


of the scale of the likelihood and could give high variance in the estimation of eq. (31a I 
affecting convergence. This behaviour was consistently observed for the same model 
without dropout and seem inherent to our application of stochastic variational Bayes. 
To reduce this source of variance, considered a version of algorithm where instead of 
evaluating the expectations eq. (311 by stochastically sampling from the variational 
distribution we fixed the values of to lie on a grid and approximated the integral 
by a sum weighted by the likelihood of the variational distribution. We fixed the grid to 
contain 100 equidistant points at the support of the variational distribution. While this 
Rao-Blackwellization considerably reduced the variance and led to a much more stable 
procedure, however the method still showed some dependency on the initialization of 
Cl 1 1 ■ 


5.3 Results 

The analytical and stochastic variational Bayes approximation was applied and com¬ 
pared on the five standard datasets of Bache and Lichman (2013). The dimensions of 
which can be seen in table [T] Each of the five datasets was transformed to have zero 
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mean and unit variance along each dimension and the target y was binary. We consid¬ 
ered 3 versions of each dataset obtained by augmenting the dataset with J = 0, J = 50 
and J = 100 columns of ’’junk” dimensions drawn i.i.d. from a normal distribution with 
unit variance; this was done to ensure Bayesian dropout could be tested in a regime 
where co-adaptation was possible. 


The evaluation was performed by randomly splitting the datasets into a training 
(80% of the observations) and test (20% of the observations), evaluating the two in¬ 
ference methods on the logi stic regression model of eq. (39) where the parameters on 
the Gamma prior (see eq. (39a)) for was set to a = 5 = 1. Notice this choice 
favor small values of a corresponding to high regularization. For the two inference 
methods we used the following settings. For the analytical approximation: we applied 
Hamiltonian MCMC (Neal 2011) to the target density eq. (47). The parameter val¬ 
ues were adapted as conservative versions of those used by |Girolami and Calderhead 


(2011) and we used the same settings for all simulations. In the notation of Girolami 
and Calderhead (2011) we choose path-length L = 20, a = 100 
the identity matrix and a stepsize of 0.8. For (t“^ 


Mass-tensor equal to 
we choose Metropolis-Hastings with 
a normal distribution of variance 0.1 as proposal kernel. The HMCMC method was 
evaluated for 25000 iterations, the first 5000 discarded as burnin. For the stochastic 
variational Bayesian inference scheme we applied the previously described procedure 
with N = 20000, e = ^ and ^ = 100. 


For both procedures we evaluated the AUG score on the test data for each dataset 
(with a particular number of junk features J = 0, 50,100) and for varying dropout rate 
/. / = 0 corresponding to no dropout, i.e. in the case of HMCMC exact inference in 
the true model. The mean average predictive performance and variance of the mean of 
40 such random splits into training and test data can be seen repeated 40 times and the 
mean/variance is reported in in Figure]^ By re-running the methods several times on 
the same split we found the variance primarily reflect variance in the test/training splits 
and not inference methods. For no junk features, the two inference methods give nearly 
identical predictive performance, either show no or nearly no effect of dropout (weakly 
negative for 3 datasets, neutral/weakly positive for 2 other). For a larger number of junk 
features, dropout lead to increased performance using both inference methods. Notice 
again / = 0 for the analytical approximation (red) correspond to standard Baysian 
inference for the logistic regression model and dropout show significant gains compared 
to this value. For smaller values of the dropout rate (/ < 0.2) and on all datasets the 
HMGMG inference scheme seems to give the better performance and for larger values 
of / it give the better performance in nearly all cases. More importantly, both inference 
methods seem to support the utility of dropout in the case of ill-posed problems. 


6 Conclusion 

Dropout provides a simple yet powerful tool to avoid co-adaptation in neural networks 
and has been shown to offer tangible benefits. However, its formulation as an algorithm 
rather than as a set of probabilistic assumptions precludes its use in Bayesian modelling. 
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Figure 3: Performance of the analytical approximation (red) and stochastic variational 
Bayes (blue) on the Bayesian logistic regression model of eq. (39) as applied to the five 
classihcation problems of table The three columns correspond to adding a number 
J = 0, 50,100 of i.i.d. normal ’’junk” features to the data to make the problem more 
ill-posed, thereby reducing predictive performance. The figure show mean and variance 
of the mean of the AUC score as computed over 40 splits into a 20%/80% test/training 
data as a function of the dropout rate /. Notice / = 0 (no dropout) for the red curve 
(the analytical approximation) corresponds to ordinary Bayesian inference. As can be 
seen, dropout lead to significant gains in performance for the more ill-posed problems. 
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We have shown how dropout can be interpreted as optimal inference under a particular 
constraint. This qualifies dropout beyond being a particular optimization procedure, 
and has the advantage of giving researchers who want to apply dropout to a particular 
model a principled way to do so. 

We have demonstrated Bayesian dropout on an analytically tractable regression 
model, providing a probabilistic interpretation of its mechanisms for regularizing and 
preventing co-adaptation as well as its connection to other Bayesian techniques. In our 
experiments we find that dropout can provide robustness under model misspecification, 
and offer benefits over ordinary Bayesian linear regression in a real dataset. We also 
discussed two schemes which allow dropout to be applied in a wider setting. One based 
on an analytical approximation to the dropout target, the other based on stochastic 
variational Bayes which, by only requiring an unbiased estimator of the true dropout 
target, seems nearly ideally suited for dropout. When these techniques were applied to a 
Bayesian logistic regression problem we found stochastic variational Bayes to have some 
significant convergence difficulties; notice these were also found for ordinary Bayesian 
logistic regression without dropout and require further investigation. By increasing 
the effort of stochastic variational Bayes we arrived at estimates which showed good 
qualitative agreement with the analytical approximation as evaluated by Hamiltonian 
Markov chain Monte Carlo. Both approximations showed dropout to have little or no 
effect in the well-specified regime, however when the number of spurious features were 
increased dropout led to large increases in performance. 

In a larger scope, we believe the view that probabilistic modelling may be thought to 
consist of not only specifying a uniquely optimal model, but as posing general restrictions 
on the model class provide an important departure from the existent Bayesian paradigm. 
If this is ultimately true, however, require the method demonstrate its versatility and 
we see the present formalism of dropout as a single step on this path. 
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